Skip to content

Fix LLM evaluator skip-handling: status-first guard + pre-validation#5116

Open
AliMahmoudzadeh wants to merge 1 commit into
mainfrom
fix/llm-evaluator-skip-handling-status-first
Open

Fix LLM evaluator skip-handling: status-first guard + pre-validation#5116
AliMahmoudzadeh wants to merge 1 commit into
mainfrom
fix/llm-evaluator-skip-handling-status-first

Conversation

@AliMahmoudzadeh
Copy link
Copy Markdown
Contributor

Summary

Two related defects across the LLM-based evaluators were identified through end-to-end testing against the np-canary Foundry project (using minimal_repro_dataset.jsonl and minimal_repro_dataset_empty.jsonl):

  1. Skip-handling fragilitycoherence / fluency / retrieval post-process the dict returned by _the_super_do_eval and then run math.isnan(result.get(self._result_key, 0)). PR Fix TypeError when evaluator returns None score for skipped/not-applicable results #5107 added the _score is not None guard that prevents the production TypeError: must be real number, not NoneType, but the structure is fragile: any future override that forgets the guard re-introduces the bug. task_completion (and relevance) avoid this entire class of defect by checking status == "skipped" before any numeric coercion runs.

  2. No input pre-validation on retrieval — empty/missing query/context is silently sent to the LLM, which returns status: skipped, demoting the row to not_applicable instead of raising a clean USER_ERROR. (groundedness/coherence/fluency/relevance already have this via self._validator.)

Changes (3 files, +43 lines)

File Change
coherence/evaluator/_coherence.py Add status-first early-return in _do_eval before math.isnan
fluency/evaluator/_fluency.py Same
retrieval/evaluator/_retrieval.py Same + new _validate_required_inputs invoked at top of _real_call

The status-first pattern (the canonical example lives in task_completion._parse_prompty_output):

`python
result = await self._the_super_do_eval(eval_input)

Honor explicit skip status before any numeric validation. Structurally safer

than relying solely on the None-guard below.

if result.get(f"{self._result_key}_status") == "skipped":
return result
_score = result.get(self._result_key, 0)
if _score is not None and math.isnan(_score):
raise EvaluationException(...)
`

The None-guard from #5107 is retained as a backstop. The skipped path can no longer reach math.isnan at all.

Out of scope

  • relevance — already has both patterns inline. No change required.
  • groundedness — has the same outer post-check shape as coherence/fluency/retrieval and would benefit from the same W4 refactor; left out per agreed scope (this PR is the four highest-risk evaluators only).
  • The remaining ~14 LLM evaluators in the catalog need pre-validation as well; tracked separately.
  • Republishing fixed evaluators to the azureml registry (operational, not a code change).
  • Math-based evaluators (bleu/gleu/meteor/rouge/f1/task_navigation_efficiency) need typed input validation; tracked separately.

Validation

  • python scripts/validation/code_health.py -i assets/evaluators/builtin — clean.
  • pytest on test_coherence_evaluator_behavior.py + test_fluency_evaluator_behavior.py + test_relevance_evaluator_behavior.py199 passed.

Three LLM-based evaluators (coherence, fluency, retrieval) post-process the
dict returned by `_the_super_do_eval` and then run `math.isnan` on the
score. PR #5107 added a `_score is not None` guard to prevent
`TypeError: must be real number, not NoneType` on rows where the base
helper produced a skipped/not-applicable result.

The None-guard works, but it is fragile: any future override that forgets
the `is not None` check re-introduces the bug. `task_completion` avoids
this entire failure class by checking `status == "skipped"` *before* any
numeric coercion runs (gold-standard pattern). This change adopts that
ordering for coherence, fluency, and retrieval:

    result = await self._the_super_do_eval(eval_input)
    if result.get(f"{self._result_key}_status") == "skipped":
        return result
    _score = result.get(self._result_key, 0)
    if _score is not None and math.isnan(_score):
        raise EvaluationException(...)

The None-guard is retained as a backstop. Now `math.isnan` is structurally
unreachable on the skipped path.

Additionally, `retrieval` previously had no input pre-validation: empty
or missing `query`/`context` would be silently sent to the LLM, which
would return `status: skipped`, demoting the row to `not_applicable`
with no proactive USER_ERROR signal. This change adds a
`_validate_required_inputs` helper invoked at the top of `_real_call`
that raises `EvaluationException(USER_ERROR, MISSING_FIELD)` when
`conversation` is absent and either `query` or `context` is empty,
matching the pattern already used by `groundedness`,
`coherence`/`fluency`/`relevance` (via `self._validator`).

No changes needed for `relevance`: its `_do_eval` already implements the
status-first pattern inline, and `_real_call` already invokes
`self._validator.validate_eval_input`.

Behavior tests (199) for coherence, fluency, relevance all pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Test Results for assets-test

139 tests   139 ✅  4s ⏱️
  2 suites    0 💤
  2 files      0 ❌

Results for commit 68947a7.

♻️ This comment has been updated with latest results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant